Automatic extraction method of the structure of a video sequence

ABSTRACT

The invention relates to a method intended to automatically create a description of a video sequence—i.e. its table of contents—, by means of an analysis of the sequence. The main steps of the method are a shot detection, a sub-division of these shots into sub-entities called micro-segments, and the creation of the final hierarchical structure of the processed sequence. According to the invention, the shot detection step computes the mean displaced frame difference curve, detects the highest peaks of said curve, removes, by filtering, some negative or positive peaks, extracts markers, and propagates them on the curve.

FIELD OF THE INVENTION

The invention relates to a method for an automatic extraction of thestructure of a video sequence that corresponds to successive frames,comprising the following steps:

(1) a shot detection step, provided for detecting the boundaries betweenconsecutive shots—a shot being a set of contiguous frames withoutediting effects—and using a similarity criterion based on a computationof the mean displaced frame difference curve and the detection of thehighest peaks of said curve;

(2) a partitioning step, provided for splitting each shot intosub-entities, called micro-segments;

(3) a clustering step, provided for creating a final hierarchicalstructure of the processed video sequence.

The invention also relates to a corresponding method for indexing data,to a device for carrying out said method, and to an image retrievalsystem in which said method is implemented. The technique of theinvention will be particularly well-suited for use in applications thatare in relation with the MPEG-7 standard.

BACKGROUND OF THE INVENTION

The future MPEG-7 standard is intended to specify a standard set ofdescriptors that can be used to describe various types of multimediainformation. The description thus associated with a given content allowsfast and efficient searching for material of a user's interest. Theinvention relates more specifically to the case of representation ofvideo sequences, intended to provide for users modalities of searchinformation. For a video sequence, the goal of a table of contentsdescription of this document is to define the structure of this sequencein a hierarchical fashion, similarly to what is done for books, in whichtexts are divided into chapters and paragraphs: the original sequence issubdivided into sub-sequences, which may be further divided into shortersub-entities.

A method for defining such a structure is described in a european patentapplication previously filed by the Applicant with the number 99402594.8(PHF99593). According to said document, the method is divided into threesteps, which are, as shown in FIG. 1: a shot detection step 11 (in asequence of pictures, a video shot is a particular sequence which showsa single background, i.e. a set of contiguous frames without editingeffects), a partitioning step 12, for the segmentation of the detectedshots into entities exhibiting consistent camera motion characteristics,and a shot clustering step 13.

Concerning the shot detection step, several solutions were alreadyproposed in the document “A survey on the automatic indexing of videodata”, R. Brunelli and al., Journal of Visual Communication and ImageRepresentation, volume 10, number 2, June 1999, pp. 58 78-112. In themethod described in the cited document, the first step 11 detects thetransitions between consecutive shots by means of two main sub-steps: acomputation sub-step 111, allowing to determine a mean Displaced FrameDifference (mDFD) curve, and a segmentation sub-step 112.

The mDFD curve computed during the sub-step 111 is obtained taking intoaccount both luminance and chrominance information. With, for a frame attime t, the following definitions:

luminance Y={f _(k)(i, j, t)}_(k=Y)  (1)

chrominance components (U, V)={f _(k)(i, j, t)}_(k=U, V)  (2)

the DFD is given by

DFD _(K)(i,j; t−1, t+1)=f _(k)(i,j, t+1)−f _(k)(i−d _(x)(i,j), j−d_(y)(i,j), t−1)  (3)

and the mDFD by: $\begin{matrix}{{{mDFD}(t)} = {\frac{1}{I_{x}I_{y}}{\sum\limits_{k}^{Y,U,V}{w_{k}{\sum\limits_{i,j}^{I_{x}I_{y}}{{{DFD}_{k}\left( {i,{j;\quad {t - 1}},{t + 1}} \right)}}}}}}} & (4)\end{matrix}$

where I_(x), I_(y) are the image dimensions and w_(k) the weights for Y,U, V components. An example of the obtained curve (and of thecorresponding filtered one), showing ten shots s₁ to s₁₀, is illustratedin FIG. 2 with weights that have been for instance set to {w_(Y), w_(U),w_(V)}={1, 3, 3}. In this example, the highest peaks of the curvecorrespond to the abrupt transitions from one frame to the following one(frames 21100, 21195, 21633, 21724), while, on the other side, theoscillation from frame 21260 to frame 21279 corresponds to a dissolve (agradual change from one camera record to another one by simple linearcombination of the frames involved in this dissolve process) and thepresence of large moving foreground objects in frames 21100-21195 and21633-21724 creates high level oscillations of the mDFD curve.

The sub-step 112, provided for detecting the video editing effects andsegmenting the mDFD curve into shots, uses a threshold-basedsegmentation to extract the highest peaks of the mDFD curve (or anothertype of mono-dimensional curve), as described for instance in thedocument “Hierarchical scene change detection in an MPEG-2 compressedvideo sequence”, T. Shin and al, Proceedings of the 1998 IEEEInternational Symposium on Circuits and Systems, ISCAS′98, vol.4, March1998, pp.253-256.

The second step 12 is a temporal segmentation provided for splittingeach detected shot into sub-entities presenting a very high level ofhomogeneity on camera motion parameters. It consists of two sub-steps:an oversegmentation sub-step 121, intended to dividing each shot intoso-called micro-segments which must show a very high level ofhomogeneity, and a merging sub-step 122.

In order to carry out the first sub-step 121, it is necessary to definefirst what will be called a distance, (the distances thus defined allowto compare the micro-segments), and also a parameter allowing to assessthe quality of a micro-segment or a partition (=a set ofmicro-segments). In both cases, a motion histogram, in which each one ofthe bins shows the percentage of frames with a specific type of motionand which is defined as indicated by the following relation (5), isused: $\begin{matrix}{{H_{s}\lbrack i\rbrack} = \frac{N_{i}}{L_{s}}} & (5)\end{matrix}$

where s represents the label of the concerned micro-segment inside theshot, i the motion type (these motions are called trackleft, trackright,boomdown, boomup, tiltdown, tiltup, panleft, panright, rollleft,rollright, zoomin, zoomout, fixed), L_(s) the length of themicro-segment s, and N_(i) the number of frames of the micro-segment swith motion type i (it is possible that ΣH_(S)[i]>1, since differentmotions can appear concurrently).

A micro-segment is assumed to be perfectly homogeneous (or to have avery high level of homogeneity) when it presents a single combination ofcamera motion parameters along all its frames, or to be not homogeneouswhen it presents important variations on these parameters. Themicro-segment homogeneity is computed on its histogram (relation (5)):if a micro-segment is perfectly homogeneous, the histogram bins areequal either to 0 (the considered motion does not appear at all) or to 1(the motion appears ont he whole segment), while if it is not, the binscan present intermediate values. The measure of the micro-segmenthomogeneity is then obtained by measuring how much its histogram differsfrom the ideal one (i.e. it is computed how much the bins of thehistogram differ from 1 or 0). The distance corresponding to bins withhigh values is the difference between the bin value and 1; analogously,for bins with small values, the distance is the bin value itself. Anexample of histogram is shown in FIG. 3, the axes of which indicate foreach motion type its proportion (=motion presence): two motion typesintroduce some error because the motion does not appear in all theframes of the micro-segment (panleft PL and zoomin ZI), and two otherones (boomdown BD and rollright RR) introduce some error for theopposite reason.

Mathematically, the homogeneity of a micro-segment s is given by therelation (6): $\begin{matrix}{{H(s)} = {\sum\limits_{i}{e(i)}}} & (6)\end{matrix}$

where:

e(i)=1−H_(S)[i] if H_(S)[i] 0,5

e(i)=H_(S)[i] if H_(S)[i]<0,5

Hs[i]=histogram of the micro-segment s

i=motion type.

The homogeneity of a shot S is then equal to the homogeneity of itsmicro-segments, weighted by the length of each of them, as illustratedby the following equation (7): $\begin{matrix}{{H(S)} = {\frac{1}{L(S)} \cdot {\sum\limits_{j = 1}^{j = N}{L_{j} \cdot {H\left( s_{j} \right)}}}}} & (7)\end{matrix}$

where ${L(S)} = {\sum\limits_{1}^{N}L_{j}}$

is the total length of the shot S and N is the number of micro-segmentssaid shot contains (it may be noted that small values of H(S) correspondto high levels of homogeneity). The distance between two micro-segmentss1 and s2 is then the homogeneity of the micro-segments union:

d(s ₁ , s ₂)=H (s₁ U s ₂)  (8)

The initial oversegmentation sub-step 121 therefore allows tooversegment the concerned shot in order to obtain a set of perfectlyhomogeneous micro-segments, which corresponds to the following relation(9):

H(s)=0, whatever s included in S  (9)

An example of how to obtain this initial oversegmented partition isshown in FIG. 4, with motion types panleft (PL), zoomout (ZO) and fixed(FIX), s₁ to S₇ designating the micro-segments (camera motion parametersmay be unknown for some frames: in this example, the last frames of theshot—the segment s₇—do not have any parameter associated).

The merging sub-step 122 comprises a computation operation, in which thedistance between all neighbouring micro-segments (temporally connected)is computed using the equation (8) for selecting the closest pair ofmicro-segments (for possible merging during the next operation),followed by a fusion decision operation, in which, in order to decide ifthe selected pair of micro-segments will be merged, the homogeneity ofthe shot (according to the equation (7)) is computed, assuming that theminimum distance micro-segments have already been merged. The followingfusion criterion is applied:

merge if H(S) threshold T(H)

do not merge if H(S)>threshold T(H)

(this fusion criterion is global: the decision depends on thehomogeneity of the resulting partition, and not exclusively on thehomogeneity of the resulting micro-segment). If the merging is done, anew iteration starts at the level of the second sub-step (a secondcomputation operation is carried out, and so on . . . ). The mergingprocess ends when there is no pair of neighbouring micro-segments thatcan still be merged.

The third step 13 is divided into two sub-steps: a shot merging sub-step131, in which pairs of shots are grouped together for creating a binarytree, and a tree structuring sub-step 132, for restructuring said binarytree in order to reflect the similarities present in the video sequence.

The shot merging sub-step 131 is provided for yielding a binary treewhich represents the merging order of the initial shots: the leavesrepresent these initial shots, the top node the whole sequence, and theintermediate nodes the sequences that are created by the merging ofseveral shots. The merging criterion is defined by a distance betweenshots, and the closest shots are first merged. In order to compute thedistance between shots, it is necessary to define a shot model providingthe features to be compared and to set the neighbourhood links betweenthem (which indicate what merging can be done). The process ends whenall the initial shots have been merged into a single node or when theminimum distance between all couples of linked nodes is greater than aspecified threshold.

The shot model must obviously allow to compare the content of severalshots, in order to decide what shots must be merged and which is theirmerging order. In still images, luminance and chrominance are the mainfeatures of the image, while in a video sequence motion is an importancesource of information due to the temporal evolution. So, average images,histograms of luminance and chrominance information (YUV components) andmotion information will be used to model the shots.

For implementing the shot merging sub-step 131, it is necessary to carryout the following operations: (a) to get a minimum distance link(operation 1311), (b) to check a distance criterion (operation 1312);(c) to merge nodes (operation 1313); (d) to update links and distances(operation 1314); (e) to check the top node (operation 1315).

In the operation 1311, both the minimum and the maximum distance arecomputed for every pair of linked nodes. The maximum distance is firstchecked: if it is higher than a maximum distance threshold d(max), thelink is discarded, otherwise the link is taken into account. Once alllinks have been scanned, the minimum distance is obtained.

In the operation 1312, in order to decide if the nodes pointed by theminimum distance link must be merged, the minimum distance is comparedto a minimum distance threshold d(min): if it is higher than saidthreshold, no merging is performed and the process ends, otherwisepointed nodes are merged and the process goes on.

In the operation 1313, nodes pointed by the minimum distance links aremerged. In the operation 1314, said links are updated to take intoaccount the merging that has been done and, once links have beenupdated, the distance of those links which point to the new mode isrecomputed. In the final operation 1315, the number of remaining nodesis checked: if all initial shots have been merged into a single node,the process ends, otherwise a new iteration begins.

The shot merging sub-step 131 may yield a single tree if all the initialshots are similar enough or a forest if initial shots are quitedifferent. An example of binary tree for the creation of a table ofcontents is shown in FIG. 5. Inside the leaf nodes of this tree, itslabel and, between brackets, the starting and ending frame numbers ofthe shot have been indicated; inside the remaining nodes, the label, thefusion order (between parenthesis) and the minimum and maximum distancebetween its two siblings.

The tree restructuring sub-step 132 is provided for restructuring thebinary tree obtained in the sub-step 131 into an arbitrary tree thatshould reflect more clearly the video structure: the nodes that havebeen created by the merging process but that do not convey any relevantinformation are removed, said removal being done according to acriterion based on the variation of the similarity degree (distance)between the shots included in the node:

if the analyzed node is the root node (or one of the root nodes ifvarious binary trees have been obtained after the merging process), thenthe node should be preserved and appear in the final tree;

if the analyzed node is a leaf node (i.e. corresponds to an initialshot), then it has also to remain in the final tree;

otherwise, the node will be kept in the final tree only if the followingconditions (10) and (11) are satisfied:

 |d(min)[analyzed node]−d(min)[parent node]|<T(H)  (10)

|d(max)[analyzed node]−d(max)[parent node]|<T(H)  (11)

 As shown in FIG. 6, the tree resulting from the restructuring sub-step132 represents more clearly the structure of the video sequence: nodesin the second level of the hierarchy (28, 12, 13, 21) represent the fourscenes of the sequence, while nodes in the third (or occasionally in thefourth) level represent the initial shots.

However, when implementing the method known from the cited document andhereinabove recalled, it may be noticed that this type of method issometimes sensitive to noise, which then makes difficult to detect peaksof small contrast as those corresponding to fading or special effects.

SUMMARY OF THE INVENTION

It is therefore an object of the invention to propose a more robustmethod for creating the description of a video sequence, in which saidlimitation is no longer observed.

To this end, the invention relates to a method such as defined in theintroductory paragraph of the description and which is moreovercharacterized in that the shot detection step includes an additionalsegmentation sub-step, applied to said means displaced frame differencecurve and comprising the following operations:

(a) a first filtering operation, based on a structuring element removingthe negative peaks the length of which is less than a predefined value(min);

(b) a second filtering operation, based on a contrast filter removingthe positive peaks that have a positive contrast lower than a predefinedvalue c;

(c) a marker extraction operation;

(d) a marker propagation operation.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will now be described, by way of example, withreference to the accompanying drawings in which:

FIG. 1 shows a block diagram of the definition method described in thecited European patent application;

FIG. 2 illustrates an example of mDFD curve for a given sequence offrames;

FIG. 3 shows an example of histogram illustrating the measure of thehomogeneity;

FIG. 4 illustrates a sub-step of the above-described definition method;

FIG. 5 shows a binary tree such as created by implementation of a shotmerging sub-step provided in said definition method;

FIG. 6 shows the tree yielded after a restructuring sub-step of saiddefinition method;

FIG. 7 shows a block diagram of the definition method when the technicalsolution according to the invention is now implemented;

FIG. 8 illustrates a method for indexing data that have been processedaccording to the invention;

FIG. 9 illustrates an image retrieval system implementing said indexingmethod and allowing to perform an image retrieval.

DETAILED DESCRIPTION OF THE INVENTION

It has been hereinabove indicated that the segmentation sub-step 112allows to extract the highest peaks of the mDFD curve. Although a largenumber of shots can actually be detected by means of such an operation,it seems more difficult to detect peaks of small contrast. The proposedtechnical solution is the replacement of said operation by anhomogeneity-based approach relying on morphological tools. According tosaid solution, and as illustrated in FIG. 7 which shows a block diagramof the definition method when the technical solution according to theinvention is implemented, four successive operations constituting animproved segmentation sub-step 70 are successively applied to the mDFDcurve. This sub-step 70 replaces the previous sub-step 112 of FIG. 1.

The first operation is a simplification operation 71, carried out bymeans of a temporal filtering, in the present case, a morphologicalclosing with a mono-dimensional structuring element of length (min)equal to the duration of the shortest shot to be detected. With thisoperation, negative peaks the length of which is less than (min) framesare removed. The operation 71 is followed by another simplificationoperation 72, carried out by means of a positive contrast filter, theeffect of which is to remove positive peaks that have a positivecontrast lower than a given parameter c.

A marker extraction operation 73 is then provided. Each marker,corresponding to the kernel of one shot, must cover a position of thecurve with a high probability to belong to a single shot. Becausecontiguous frames that belong to the same shot are quite similar, thevalue of the mDFD will be small around those frames. Thus, to extractthe markers, a negative contrast filter (with the same parameter c as inthe previous operation 72) is used for detecting each relative minimumof the curve. A final operation 74 allows to propagate the markers onthe curve until all points are assigned to a marker. This propagationprocess is performed by applying for instance the well known watershedalgorithm on the mDFD curve using as initial markers those obtained inthe previous operation 73.

In the example of filtered curve of FIG. 2, the resulting markers anddetected shots use (min)=10 and c=10. Even though some oversegmentationappears around frames 21150 and 21700, both the scene cuts and thedissolve have been correctly detected. Such an oversegmentation is not aproblem, because it will be eliminated during the next steps 12 and 13of the method.

The invention is not limited to the implementation described above, fromwhich modifications or broader applications may be deduced withoutdeparting from the scope of the invention. For instance the inventionalso relates to a method for indexing data that have been processedaccording to the previously described method. Such a method, illustratedin FIG. 8, comprises a structuring step 81, carrying out a sub-divisionof each processed sequence into consecutive shots and the splitting ofeach of the obtained shots into sub-entities (or micro-segments), and aclustering step 82, creating the final hierarchical structure. Thesesteps 81 and 82, respectively similar to the steps 11-12 and to the step13 previously described, are followed by an additional indexing step 83,provided for adding a label to each element of the hierarchicalstructure defined for each processed video sequence.

The invention also relates to an image retrieval system such asillustrated in FIG. 9, comprising a camera 91, for the acquisition ofthe video sequences (available in the form of sequential videobitstreams), a video indexing device 92, for carrying out said dataindexing method (said device captures the different levels of contentinformation in said sequences by analysis, hierarchical segmentation,and indexing on the basis of the categorization resulting from saidsegmentation), a database 93 that stores the data resulting from saidcategorization (these data are sometimes called metadata), a graphicaluser interface 94, for carrying out the requested retrieval from thedatabase, and a video monitor 95 for displaying the retrievedinformation. The invention also relates, obviously, to the videoindexing device 92, that allows to implement the method according to theinvention.

What is claimed is:
 1. A method for an automatic extraction of thestructure of a video sequence that corresponds to successive frames,comprising the following steps: (1) a shot detection step, provided fordetecting the boundaries between consecutive shots—a shot being a set ofcontiguous frames without editing effects—and using a similaritycriterion based on a computation of the mean displaced frame differencecurve and the detection of the highest peaks of said curve; (2) apartitioning step, provided for splitting each shot into sub-entities,called micro-segments; (3) a clustering step, provided for creating afinal hierarchical structure of the processed video sequence; whereinsaid detection step includes an additional segmentation sub-step,applied to said means displaced frame difference curve and comprisingthe following operations: (a) a first filtering operation, based on astructuring element removing the negative peaks the length of which isless than a predefined value (min); (b) a second filtering operation,based on a contrast filter removing the positive peaks that have apositive contrast lower than a predefined value c; (c) a markerextraction operation; (d) a marker propagation operation.
 2. A methodaccording to claim 1, wherein said marker extraction operator isimplemented by means of a negative contrast filter using the samepredefined value c.
 3. A method according to claim 2, wherein saidmarker propagation operation is performed by applying the so-calledwatershed method.